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Abstract 

Background: In population-based observational studies, non-participation and delayed response to the invitation to 
participate are complications that often arise during the recruitment of a sample. When both are not properly dealt 
with, the composition of the sample can be different from the desired composition. Inviting too many individuals or 
too few individuals from a particular subgroup could lead to unnecessary costs or decreased precision. Another 
problem is that there is frequently no or only partial information available about the willingness to participate. In this 
situation, we cannot adjust the recruitment procedure for non-participation before the recruitment period starts. 

Methods: We have developed an adaptive list sequential sampling method that can deal with unknown 
participation probabilities and delayed responses to the invitation to participate in the study. In a sequential way, we 
evaluate whether we should invite a person from the population or not. During this evaluation, we correct for the fact 
that this person could decline to participate using an estimated participation probability. We use the information from 
all previously invited persons to estimate the participation probabilities for the non-evaluated individuals. 

Results: The simulations showed that the adaptive list sequential sampling method can be used to estimate the 
participation probability during the recruitment period, and that it can successfully recruit a sample with a specific 
composition. 

Conclusions: The adaptive list sequential sampling method can successfully recruit a sample with a specific desired 
composition when we have partial or no information about the willingness to participate before we start the 
recruitment period and when individuals may have a delayed response to the invitation. 

Keywords: List sequential sampling, Sample representativeness, jrps sample. Population-based observational studies 



Background 

Population-based observational studies are frequently 
used to measure the prevalence of characteristics such as 
diseases by means of a sample from a population [1]. Two 
important problems that arise when a sample is recruited 
is that (/) not everyone in the population has the same 
willingness to participate in the study [2-6], and («') after 
inviting an individual, it might take some time before we 
receive a response. 

Variation in the willingness to participate may bias the 
results of the study. To deal with this problem, we could 
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invite more individuals from groups related to a low will- 
ingness to participate [7]. However, this approach requires 
that the participation probability per person or group 
is known before the sampling procedure starts. Unfor- 
tunately, this detailed knowledge on the willingness to 
participate among sub-groups in the population is often 
not available. If the willingness to participate is less than 
assumed we will invite too few individuals, which leads to 
a too small sample and a decreased precision. On the other 
hand, inviting too many individuals will lead to extra costs. 
Generally, we invite too many individuals when we under- 
estimate the willingness to participate and there is delayed 
response to the invitation. In general, not accounting for 
delayed response will lead an unexpected number of extra 
individuals in the sample at the end of the recruitment 
period. 
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An example of a complex sampling problem is observed 
in the HELIUS study [8]. One objective of the HELIUS 
study is to measure ethnic inequalities in the inci- 
dence and prognosis of major diseases in the pop- 
ulation of Amsterdam. The desired sample should 
have approximately 5000 individuals in each ethnic 
group, and should be representative for the popula- 
tion of Amsterdam. This is achieved by stratifying 
on the auxiliary variables: place of residence (spa- 
tial), age, (continuous), gender (categorical), and social 
economic status (categorical) available from municipal 
registries. 

Unfortunately, it is not straightforward to implement 
stratification when we have a large number of auxiliary 
variables of mixed types [9]. In this case, too small or even 
empty strata might be obtained when we cross all strata 
from all variables. An alternative variance reduction tech- 
nique, proposed by Grafstrom et al., is to obtain a well 
spread set of participants [10,11]. Basically, a set of partic- 
ipants is well spread when the number of participants is 
close to what is expected on average, for every set of aux- 
iliary variables. Grafstrom et al. showed that the variance 
of commonly used estimators is usually low with a well 
spread set of participants. 

In this paper, we use the list sequential method, devel- 
oped by Bondesson and Thorburn [12] to obtain a well 
spread set of participants without replacement from a 
finite population. Instead of trying to cross all strata from 
all auxiliary variables, our approach is based on a distance 
function between individuals. Similar or almost similar 
individuals should seldom be invited both to participate 
in the study. In its current form, the list sequential sam- 
pling method cannot be used to recruit sets of participants 
for population-based observational studies because the 
list sequential sampling method assumes that («') everyone 
participates in the study and that (ii) there is no delayed 
response to the invitation. 

We developed approaches to correct for non- 
participation and delayed response to the invitation 
when we use a list sequential sampling method. The list 
sequential sampling method evaluates individuals from 
the population in a sequential order, and uses a random 
process to decide whether or not an individual should be 
invited to participate in the study. In this decision we have 
to correct for any non-participation. An approach is to 
weigh the probability of being invited with the (estimated) 
participation probability. When there is no or partial a- 
priori knowledge on the participation probability, we can 
estimate this probability during the recruitment period 
using the information from already invited individuals. To 
combine both prior information and information that is 
generated during the recruitment period, we developed a 
Bayesian approach to estimate the participation probabil- 
ities. Moreover, to deal with the delayed response to the 



invitation, we use the expected response of an individual 
when we have no answer yet. 

We performed a simulation study to illustrate the per- 
formance of the adapted list sequential sampling method, 
when we have unknown heterogeneous participation 
probabilities and delayed response to the invitation. 

Methods 

Problem description 

We consider a finite population D containing « individu- 
als, where each individual i is described by a vector x, of 
auxiliary variables. The auxiliary variables x/ are known 
for each individual before the recruitment period starts. 
Usually X, is available from municipal or national person- 
registries. Examples of these variables are gender, age, 
place of residence, and social economic status. In addi- 
tion to x,-, each individual i has an unobserved outcome of 
interest yi. The goal of this paper is to obtain a sample of 
size m (m < n) from D, in which we can observe yt. 

A sample is described by the vector s = (si, . . .,s„), 
where s,- takes the value 1 if individual / is in the sample 
and 0 otherwise [13,14]. With this representation there 
are 2" possible samples. Before the recruitment period 
starts we need to determine tt/, which is the probability 
that individual i is included in s {i.e. p(si = 1) = jt,). We 
want to recruit a sample of m individuals and therefore 
= where w is a positive integer. 

Different choices can be made for the inclusion prob- 
abilities Jti. For instance, we can assign equal inclusion 
probabilities to all individuals, i.e. tt; = m/n. In this case, 
the sample s is expected to be a 'miniature' version of 
the population D, because we expect s to have approxi- 
mately the same composition of auxiliary characteristics 
as D. In this case, the sample is referred to as a represen- 
tative sample [11]. However, jti is frequently chosen to be 
proportional to x,. For example, by oversampling a rare 
subgroup we could increase the precision of the result for 
that particular subgroup [15]. 

List sequential sampling method 

To obtain the sample we use the list sequential method 
based on sampling without replacement developed by 
Bondesson and Thorburn [12]. To illustrate the list 
sequential method, we first consider the situation in which 
all invited individuals will participate in the study. 

During the recruitment period, we sequentially decide 
for each individual i from D whether we include this indi- 
vidual in the sample (s, = 1) or not (s, = 0). After this 
decision, the probability of being included in the sam- 
ple for the remaining non-invited individuals from D is 
updated. Let Jt'-^^ = (7t[^\ • • • , ^n^''^ be the vector of ini- 
tial inclusion probabilities which is determined before the 
sampling procedure starts, i.e. tv-^-' = ni. We sequentially 
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evaluate each individual ; from the population and update 
the inclusion probabilities of all non-evaluated individ- 
uals after each evaluation. For the first individual, we 
have p(si) = 7z[^\ Depending on whether individual 1 
is included in the sample or not, the inclusion probabil- 
ities of all other, non-evaluated, individuals are updated. 
This gives us the vector n'-^^ from which we use 7T2^'' to 
determine S2; i.e. decide whether to include the second 
individual in the sample or not. The updating scheme can 
be represented as 



;r(0) = 




^(0) 


^(0) 






= 












;r(2) = 


(si 










;r(3) = 


(si 


S2 


S3 







Generally, when we evaluate individual i, we will use the 
inclusion probability ^' to determine 5,-. After the eval- 
uation of individual i, we update all probabilities jt^ for 
7 > i with 



( (i) 



(1) 



where are weights that may depend on s\, si, . . ., Si-\. 
Note that w^|^^ determines how n^^ is affected by the sam- 
pling outcome from the individual since w.^^ influences 
the second order inclusion probability j5(si — l,Sj — 1). 
The sampling scheme gives a sample of size m, when the 
weights are restricted to sum up to one, i.e. YljLi+i ^j-i = 

1. To guarantee that 0 < tv-^-' < 1, all weights should 
satisfy 



. min ' , ^ < w < 



7T: 



7t, 



,0-1) 



(2) 



mm 



Within these bounds, we can impose different restrictions 
on wj^j, resulting in samples with certain characteristics. 

Generally, when wj^^ > 0 we have corr(s, = 1, 5y = 1) < 0 
(i.e. a negative correlation between the sampling indica- 
tors of individuals i and /), whereas with vcP, < 0, we 
have corr(5j = l,sy = 1) > 0. For more detail about the 
list sequential method, we refer the reader to respectively 
theorem 1 and remark 1 from Bondesson and Thorburn 
[12]. 

Well spread samples 

We are interested in recruiting a well spread sample 
with the list sequential sampling method. Usually, a well 



spread sample leads to parameter estimates with low vari- 
ances. Before we can introduce the definition of a well 
spread sample, we require the concept of coherent sub- 
sets. Let d{i, k) be the distance between individuals i and 
k. A subset D' from the population D is coherent if the 
following holds. First, let some individual i e D'. Individ- 
ual k is included in D' if and only if d{i, k) < r, where 
r > 0. Consequently, D' can be constructed by including 
all individuals within a ball of radius r around individual i. 

Grafstrom and Schelin considered a sample to be well 
spread with respect to the inclusion probabilities jc when, 
for every coherent subset D' C D, 

n'^J2 (3) 

A smaller distance to individual i increases the probabil- 
ity of being included in the coherent subset D'. To satisfy 
(3), it is clear that the inclusion probability of individual i 
should be more influenced by the sampling indicators 5 of 
individuals with a smaller distance. We propose to mea- 
sure distance between individuals with the auxiliary vari- 
ables X, where d{xi, x^) is the distance between individual 
/ and k. Based on the types of auxiliary variables, we can 
choose, for instance, the Mahalanobis or the Manhattan 
distance. 

To obtain a well spread sample with the list sequential 
sampling method, we will use preliminary weights which 
are specified before the recruitment period starts. The 
preliminary weight w^^^ reflects the effect of si^ from indi- 
vidual k on the inclusion probability of individual i. The 
weights are referred to as preliminary because the upper 
bound from (2) has an effect on the conditional inclusion 
probabilities. 

The preliminary weights are constructed in the follow- 
ing way. Let c^'^ be the rank of the distance of the k'^^ 
individual to individual i, where k ^ i. We rank the dis- 
tances in ascending order, where we assign c^'^ = 1 to the 
closest individual, c^^^ = 2 to the second closest individ- 
ual, and so on. To construct the preliminary weights, we 
could use the linear function 



(4) 



where the weights fi and A. < 0 are arbitrarily chosen 
weights. The sampling indicator of individual k has a 
larger effect on individuals at smaller distance, whereas 
it has less effect on individuals at further distance. To 
recruit a set of approximately m individuals, we restrict 
the weights to satisfy J^kjti ^'k ~ ^^ 

Heterogeneous participation probabilities 

A problem of sampling from population D is that indi- 
viduals that are invited to participate in the study 
can decline the invitation. Let b = (bi, bn) be 
the vector that indicates whether an individual / is 



Hofetal. BMC Medical Research Methodology 20^4, 14:81 
http://www.biomedcentral.eom/l 471-2288/1 4/81 



Page 4 of 9 



invited to participate {bi = 1) or not {hi = 0). 
When individual i refuses to participate in the study, we 
have s, = 0 and we do not observe yi. Let 0 = (</>!,..., </>„) 
be the vector that contains the participation probabiUty 
per person in the population, where 4>i = piH = Mbi = 1). 
Note that when every invitee participates (i.e. (pi = 1, for 
i = 1, . . . ,n), we have s = b. 

Let 7T^' be the inclusion probability Jt^'' corrected 
for non-participation, i.e. the probability of being invited 
to participate in the study for individual i from D. When 
(pi is known before the recruitment period starts, non- 
participation can be dealt with by using jr-''~^^ = TT-'^^^/cpi 
as probability to invite individual i. Moreover, we can 
use the updating rule from (1) to update the inclusion 
probabilities of the non-evaluated individuals it-, j > i, 
after individual / responded to the invitation. This will 
give us a sample that approximately satisfies the inclusion 
probabilities n. 

The following small sampling problem illustrates this 
modification. Consider that, for the first individual, we 

have n^^ = 0.25 and (pi = 0.5. The probability to invite 

■AO) 



mechanism for the participation probabilities, where the 
participation probability of individual i only depends on 
observed characteristics Z(, i.e. /?(S( = \\bi = l,Zj). The 
participation probability can be written as 



p(s = l\bi = l,Zi,a,P) = 



l + exp{a +f(zi,fi)}' 



(5) 



where a is the intercept term, and /() is a function 
of the observed characteristics Zj and the regression 
weights p. Because more information becomes available 
during the recruitment period, the participation prob- 
ability estimates become more accurate. The vector of 

estimated participation probabilities of all n individuals 

' (0 

after the evaluation of individual i is denoted as 0 = 
(^['\ . . . , We then adapt the inclusion probabilities 

After an invitation has been send to an individual, it 

might take some time to get a response. Let m^'^ be the 
indicator whether individual / has responded to the invita- 



this individual is therefore = 0.5. Using this strat- ^j^^ ^^^^^^ individual / is evaluated, where if = 1 when 
egy there might be some individuals i with tt-' > 1. , j (') n u a ^ u <-u 



egy there might be some individuals i with ir- 
This means that the participation probability of individual 
i is too low with respect to 7r/'~^\' the desired probabil- 
ity to be included in s for individual i cannot be reached. 
For instance, this would happen in the example above for 
individual 1 when (pi =0.1 and consequently =2.5. 
This means that we have to invite individual 1 two and a 
half times to satisfy Jt^'^K Because we can only invite an 
individual once, we restrict all values jrf to be one or 
lower. 



Adaptive list sequential sampling method 

Usually, (pi is not known before the recruitment period 
starts. In this section we suggest how (pi can be estimated 
adaptively during the recruitment period. In addition, we 
consider delayed response to the invitation. 

For each individual, we have some knowledge about 
the willingness to participate before the recruitment 
period starts. For example, we might have participation 
estimates from a small pilot study or from previously 
performed studies. In addition, information from the 
invited individuals becomes available during the recruit- 
ment period. Therefore, we propose to use a Bayesian 
method to estimate the participation probability of indi- 
vidual i during the recruitment period, in which we 
use both the available prior knowledge and the infor- 
mation that becomes available during the recruitment 
period. 

Let Z( be the vector of all observed characteristics 
of individual i, which are related to the participation 
probability. We assume a missing at random type of 



we observe Sj and Uj =0 when we do not observe the 
participation indicator s, during the evaluation of individ- 
ual /. Note that when individual / has not been invited 
(i.e. bj — 0), Sj = 0 since individual J is not included in 
the set of participants. A problem of delayed response is 
that we cannot use the update rule from (1) to determine 
TTy when the participation indicator of the previous indi- 
vidual is not observed. Consequently, we cannot update 
7tj'^ which means that our sampling method is less suc- 
cessful in recruiting a well spread sample. As a solution, 
we propose to use the data from all previously invited 
individuals, and replace the non-observed participation 
indicators with their estimated expected value. We use 
this approach in step 1 of the adaptive list sequential 
sampling method listed below. 

Before we start the adaptive list sequential sampling 
method, we specify the vector which con- 

tains the initial probabilities of being included in s for 
every individual i in D. The desired number of individu- 
als in s is w = Xi/Li ^Z"^' where w is a positive integer. 
The first individual from D is invited with the probabil- 
ity n^-' = Ttf'-' /$^^\ where 4'f'-' is an initial guess of the 
participation probability of the first individual. All other 
individuals from D are invited in a sequential way, where 
the steps of the adaptive list sequential sampling method 
for individual i = 2,...,n are 

1. Calculate jrf "^^ 

To deal with delayed response to the invitation, we pro- 
pose to use a modified version of the column-wise updat- 
ing rule proposed by Bondesson and Thorburn [12]. 
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We calculate n-' by iterating over k = l,2,...,i— 1, 
where 



and is calculated as 



Ak-l) 



w. 



,(0 



(6) 



(0 • / -(') 
wj. = mm I wy 



The weight w^'' determines the effect of s/^ on jzy^' and 



■J 



(k) 



therefore also tT;^' The choice of preliminary weights 



iv^'^ is discussed in the previous section. Because (6) still 
requires the observed indicators Si,S2,..., we modify 
(6) to deal with delayed response to the invitation. When 

„(0 



^^^kt where 



0, we replace sj^ with its estimated expectation 
^' is the participation probability esti- 
mate of individual k from the previous evaluation i — 1. 
The delayed response adjusted column-wise updating rule 
from (6) is 



r« - 



,(^-1) 
■i 

Ak-l) 



- {sk - 



Ak-l)^ 



„(0 



bk 



■f (0 

■c (i) 
if 4' 



2. Calculate irf 

Decide whether individual / should be invited to partici- 
pate in the study, where fc, = 1 if the individual is invited 
and hi = 0 if not. This decision is based on the probability 
of being invited, 



lii-i)' 



(7) 



where 4>j' is the participation probability estimated 
from the previous evaluation i — 1. We draw the decision 
to invite individual i from a Bernoulli distribution with 

p(bi = 1) = ;r/'-^'. 

3. Update the vector 

Let RW = {r,b = 1,^® = l,r e D} be the set of all 
mi individuals that responded to the invitation to partici- 
pate. Each individual from R*^'^ is described by r = (s, z), 
where s = 1 when invitee r participates and s = 0 oth- 
erwise, and z is a vector of known characteristics. The 
participation probability of individual k is defined as (5). 
Because we might have some a-priori knowledge about 
the intercept a and the regression weights p, we use 



Bayesian inference to estimate the posterior distribution 
^(a,j8|R«), i.e. 



•(«,j8|R«) 



h{R'^'^a,p)f(a,fi\e) 



(8) 



where ^ is a vector of parameters, and / () is the prior dis- 
tribution of (a, j8). The likelihood of R^'^ given (a, p) is 



h(R^''^\a, fi) = Y[p (se = l\ze, a, {l -p(st = l|z^,a,yS)} 



l-st 



1=1 



where p(sc = a, fi) is given by (5). Following (8) we 
update the vector of estimated participation probabilities 



0 , where for individual k = !,...,« 



Jia, 



P) 



p{sk = l\zk,oi,p)gia,p\R^^)d{a,p) 



To estimate 4>\^\ we can use quadrature or MCMC 
methods. The values of 0 depend on the amount of prior 
knowledge that is available before the recruitment period 
starts. For instance, we can assume that (a, P) is sampled 
from some flat distribution with large variance when no 
prior knowledge is available. 

Simulations 

We illustrated the performance of the adaptive list 
sequential sampling method with two simulations. In 
these two simulations, we created populations with 
unknown heterogeneous willingness to participate and 
delayed response to the invitation. The first simulation 
was focused on recruiting a well spread, representative 
set of participants. In the second simulation, we investi- 
gated stratified sampling from a population in which some 
subgroups were over-represented. 

Simulation 1 

Consider a population D of size n = 4000 from which we 
drew a random sample without replacement of size m = 
400 with the adaptive list sequential sampling method. 
To recruit a representative sample from the population, 
we assigned equal inclusion probabilities to all individu- 
als from the population; i.e. tt/"^ = m/n = 0.1 for / = 
!,...,«. When the sample is well spread, the distribution 
of the auxiliary characteristics x should be approximately 
similar in the population and the sample. 

The data was generated as follows. The vector z/ 
was drawn from a multivariate normal distribution with 
means zero, and covariances zero. The probability of 
positively responding to the invitation was p(si = 
l\bi = 1,Z() = invlogit[a + ZiP], where invlogit 
denotes the inverse logit transformation, a = 1, and 
P = (0.3,-0.7,0.1,0.4). The response was drawn from 
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a Bernoulli distribution with p(si = l\bi = l,z,). In 
addition, for individual i, delayed response to the invi- 
tation was simulated by drawing time ti from a Poisson 
distribution with expectation 15. Individual i responded 
to the invitation after the evaluation of individual i + ti. 
Thus if ti = 0, individual / responded immediately to the 
invitation. 

For individual i, the characteristics x; were drawn from 
a multivariate normal distribution with means zero, vari- 
ances one, and covariance matrix 

/ 1.00 0.20 -0.50 0.30 \ 
0.20 1.00 0.20 -0.40 
-0.50 0.20 1.00 -0.20 ' 
0.30 -0.40 -0.20 1.00 j 

To obtain a well spread and representative sample, we 
used the adaptive list sequential method. To satisfy (3), we 
used the Mahalanobis distance to quantify the distance 
between individuals. We ranked the distances in ascend- 
ing order and used the order to determine the preliminary 
weights wf, for / = 1, . . . , « and k ^ i. Using (4), we 
specified the following adaptive list sequential sampling 
methods with different characteristics 

Simple random sampling: Assign zero to all weights 
iv^'^K Consequently, w^'^ = 0 and therefore 7tl'~^^ = 
Ttf^. With these weights, we used the initial inclusion 
probability nf^ to determine whether we should invite 
individual i. 

Adjusted sampling 1: The inclusion probability of indi- 
vidual i 7t-'~^^ was equally influenced by all « — 1 = 3999 
other individuals by using the preliminary weights iv^'' = 
1/3999. 

Adjusted sampling 2: Only the 50 nearest neighbors of 
individual i influenced the inclusion probability tt/' ^' by 
using the preliminary weights 

.(,) fl/50 ifc^<50, 
w,' = \ 

1 0 otherwise. 

We used an estimated participation probability to deal 
with non-participation. Two different approaches to esti- 
mate the participation probability were evaluated. The 
first approach was to use all available data to esti- 
mate the participation probability, i.e. 4>i'^ ^' = p{si = 
l\bi = 1, z/) = invlogit[a -|- Zifi]. With the second 
approach, we assumed that z/ had no impact on the 
participation probability, i.e. 0f~^^ = pisi = l\bi = 
l,Zi) = invlogit[a]. The second approach was used 
to investigate whether the impact of miss-specifying 



(^■'^^^ had a large impact on how well the sample was 
spread. 

We assumed that we had no prior knowledge about the 

participation probability before the recruitment period 
started. Therefore flat, non-informative priors were used 
for a and all regression weights p by assuming they fol- 
lowed normal distributions with means zero and variance 
100. Because we assumed zero means, the initial estimated 
participation probabilities were 50%, i.e. = 0.5 for 
i = 1, . . . , M. 

We quantified how well a sample was spread with the 
following measure based on Voronoi polytopes, suggested 
by Grafstrom and Lundstrom [10]. Let individual i e s, 
i.e. individual i is included in the set of participants s. The 
Voronoi poly tope Vi consists of all individuals / from the 
population D for which d(x.i,x.j) < d(xi(,x.j), for all other 
individuals ^ e s. Note that when <i(xj, Xj) = d(xi;, Xj), 
individual / is included in both polytopes v/ and Vj^, but 
weighted with 1 /2. 

Let qi be the sum of initial inclusion probabilities of the 
individuals in v/, 

Grafstrom and Lundstrom showed that a sample can be 
considered to be well spread if qi is one or close to one for 
all polytopes v,. Therefore, a measure to quantify how well 
spread a sample is 




where a low R corresponds to well spread sample. To 
investigate how well the adaptive list sequential sampling 
methods performed in recruiting a well spread sample, the 
simulation was performed 1000 times. We calculated the 
mean and variance of R, and the average sum of recruited 
participants. Note that the best adaptive list sequential 
sampling method should give us a set of approximately 
400 participants with a low R in every simulation. 

Simulation 2 

In simulation 2, we considered a population D of size n = 
5000, in which each individual was described by a categor- 
ical auxiliary variable and a unobserved binary outcome 
of interest yi. The auxiliary variable Xi had five possible 
values g. The main goal of this simulation was to estimate 
the sum of the outcome y in the population, denoted as 
Y = Yll^i yi' with a set of participants in which we can 
measure y. Moreover, we had resources to measure j in a 
set of participants of size m = 500. The set of participants 
was obtained with an adaptive list sequential sampling 
method where we dealt with non-participating during the 
recruitment period. 
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Individuals in different subgroups had different par- 
ticipation probabilities and different frequencies of the 
outcome y. The characteristics of the populations were 



p(x, 

pisi = l\bi = l,x, 
piji = l\x, 



g={l 2 3 4 5) 
= g)= (40% 20% 20% 10% 10%) 
= g)= (50% 60% 70% 80% 90%) 
= g)= (10% 20% 30% 40% 50%) 



where p(si = l\bi = l,Xi = g) was the participation prob- 
ability of individual / given Xi = g, i.e. for individual / the 
probability of participating depended on Xi. The response 
to an invitation was drawn from a Bernoulli distribution 
with probability p{si = l\bi = l,Xi = g). Moreover, 
E{Y) = nj:l^,p(yi = l\Xi = g)pixi = g) = 1150. 

The individuals in the set of participants s were used to 
estimate Y, denoted as Yht, where we used the Horvitz- 
Thompson estimator and its variance [14-16] to deter- 
mine Yht- The estimate Yht was calculated as 



Yht = J2 



(9) 



where jt/"^ was the desired probability of being included in 
the set of participants s, specified before the recruitment 
period started. The variance of Yht was approximated 
with 



y<yHT) = Y.ll 

ies jes 



yt yj 



where Jt^j is the second order joint-inclusion probabil- 
ity of the i'^'^ and f'^ individuals in s, i.e. n-P = p{si 

1 , V = 1). To determine n'.p , we used the sample based 

approximation technique proposed by Hajek [17,18]. 

The set of participants s was obtained with the adaptive 
list sequential sampling method. Before the recruitment 
period started, we specified the vector We con- 

sidered a vector n^^\ in which the probability of being 
included in s was proportional to the size of group g in 
the population. Because not all groups were observed with 
the same frequency in D, we oversampled the smaller sub- 
groups in such a way that each group g was observed 
with similar frequency in s. For each invited indivi- 
dual with = 1, we have to invite 2, 2, 4, and 4 individuals 
with respectively x = 2, 3, 4, 5 to obtain an equal number 
of individuals from each group in s. Therefore, depending 
on the value of Xi, we used the following probabilities for 
individual i 



0.05 ifxi = 1 

0.10 iixi = 2orXi = 3 

0.20 if xi = 4 or X; = 5 



Note that we could also use stratified sampling to get 
our desired set of participants because we only have five 
disjoint groups. However when we have a large number 
of groups, stratification becomes impracticable. A large 
number of groups is no problem for the (adaptive) list 
sequential sampling design, if it is possible to specify a dis- 
tance measure between individuals (see (3)). With jr , we 
expected to have an equal number of individuals for each 
subgroup^ in the set of participants. 

We considered two adaptive list sequential methods to 
recruit the sample. 

Simple random sampling: Assign zero to all weights 



ivf. Therefore TT,^'"^' 



(0) 



Adjusted sampling: To recruit a well spread sample, the 
inclusion probability of individual i should only be influ- 
enced by individuals located in the same group. Therefore, 
we used the following preliminary weights 



r,(0 



l/iiig - 1) iixi=g and x^ = g, 
0 otherwise, 



where Hg is the number of individuals in group j^. 

For both adaptive list sequential sampling methods, we 
used the following model to describe the participation 
probability 



p(si = l\bi = l,\i=g,fi) 



exp[^gl(x,- =g)] 
l + exp[;3^I(x/=^)] 



where Pg is the regression weight for group g. Because we 
assumed we had no a-priori information about the par- 
ticipation probabilities, we used non-informative priors 
for P by sampling all five parameters Pg from a normal 
distribution with mean zero and variance 100. For individ- 
ual i, delayed response to the invitation was simulated by 
drawing time from a Poisson distribution with expecta- 
tion 15. Individual i responded to the invitation after the 
evaluation of individual / -|- f,-. 

The simulations were performed 1000 times and we 
calculated the bias, MSE, and coverage of Yht for both 
adaptive list sequential methods. 

Results 

Simulation 1 

The results from simulation 1 have been summarized 
in Table 1. The results showed that the adaptive list 
sequential sampling method with the adjusted sampling 2 
performed best. In this approach, the participation prob- 
ability of individual i was only influenced by the participa- 
tion indicator of the 50 nearest neighbors. The recruited 
sets of participants better spread than with the other 
sampling approaches, reflected by the lower median and 
spread of R. 



Hofetal. BMC Medical Research Methodology 20^4, 14:81 
http://www.biomedcentral.eom/l 471-2288/1 4/81 



Page 8 of 9 



Table 1 95% Confidence interval ofR and the number of participants in simulation 1 



Estimated participation probability: invlogit[a + ztP] 


Sampling 




IVIeasurei{ 






Number of participants 


method 


2.5% 


50% 


97.5% 


Mean 


Standard deviation 


Simple random sampling 


0.192 


0.238 


0.304 


401 


18 


Adjusted sampling 1 


0.199 


0.241 


0.298 


397 


11 


Adjusted sampling 2 


0.157 


0.189 


0.225 


397 


11 


Estimated participation probability: invhgit[a] 


Sampling 




IVIeasure/; 






Number of participants 


method 


2.5% 


50% 


97.5% 


Mean 


Standard deviation 


Simple random sampling 


0.188 


0.230 


0.304 


405 


18 


Adjusted sampling 1 


0.197 


0,238 


0.291 


400 


11 


Adjusted sampling 2 


0.154 


0.184 


0.225 


400 


11 



Using all the auxiliary characteristics Zj to estimate the 
participation probability of individual i, the simple ran- 
dom sampling approach resulted in a median R of 0.238 
(95% confidence interval: 0.192-0.304). The mean num- 
ber of participants with the simple random sampling 
approach was about 401 (95% confidence interval: 365 - 
436). For the adjusted sampling 1 approach, approximately 
similar results were found for R, i.e. on average, the set 
of participants obtained with the simple random sam- 
pling approach and the adjusted sampling 1 approach 
were comparable in how well they were spread. With the 
adjusted sampling 1 approach, the average size of the set 
of participants was 397 (95% confidence interval: 376 - 
418). However, compared to the simple random sampling 
approach, the variation in the size of the set of partici- 
pants was considerably lower with the adjusted sampling 1 
approach (respectively standard deviations of 18 and 11). 

On average, a set of participants recruited with the 
adjusted sampling 2 approach was better spread than with 
the other two approaches. Not only was the median R 
0.189, the spread around the median was also smaller than 
with the other two approaches (95% confidence interval: 
0.157-0.225). The mean size of the set of participants with 
the adjusted sampling 2 approach was 397 (95% confi- 
dence interval: 376 - 418), which was comparable to the 
adjusted sampling 1 approach. 

Interestingly, the performances of all three approaches 
remained similar when we ignored the auxiliary character- 
istics z,- in the estimation of the participation probability 



of individual i. Since fitting a model with just an intercept 
gave comparable results to the more complicated model 
where we also included z/, the results suggested that the 
adaptive list sequential sampling method was robust to 
miss-specification of the participation probability model. 

Simulation 2 

The results from simulation 2 have been summarized in 
Table 2. Using the set of participants obtained with the 
simple random sampling approach resulted in a biased 
estimate of Yht- With the adjusted sampling approach, 
Yht was more accurately estimated. This was reflected 
in the bias (+31 for simple random sampling and +1 for 
adjusted sampling), and the variance of the estimate (7995 
for simple random sampling and 7817 for adjusted sam- 
pling). Consequently, the coverage of the 95% confidence 
interval was better when we used the adjusted sampling 
approach (0.86 for simple random sampling and 0.92 for 
adjusted sampling). 

Discussion 

In this paper, we developed an adaptive list sequential 
sampling method when a random sample from the popu- 
lation is required and the willingness to participate varies 
between individuals and is not known beforehand. Our 
adaptive list sequential sampling method requires that the 
characteristics that are related to the participation prob- 
ability are known of all individuals. With simulations, we 
showed that the adaptive list sequential sampling method 



Table 2 Estimated frequency Y derived from the set of participants In simulation 2 



Sampling 


Estimated 


Bias 


Variance 


Mean squared error 


Coverage of the 


method 


E(Yht) 


E(Y-Yht) 


E[V(Y„t)] 


EKYht - Y)^] 


95% confidence interval 


Simple random sampling 


1181 


31 


7995 


14457 


0.86 


Adjusted sampling 


1151 


1 


7817 


10288 


0.92 
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could successfully deal with unknown heterogeneous par- 
ticipation probabilities. 

In our adaptive list sequential sampling method, we 
evaluate each individual from the population only once. 
Therefore we only have one opportunity to decide 
whether to invite an individual or not. When we overesti- 
mate the participation probability for all individuals from 
the population, we end up with a too small set of par- 
ticipants. A simple solution for this problem would be to 
re-evaluate non-invited individuals until the desired size 
of participants in the study has been reached. 

The simulations suggested that the adaptive list sequen- 
tial sampling method is robust to miss-specification of 
the participation probability model. Just using an inter- 
cept term to describe the participation probability seems 
to work quite well. However, to what extent the adap- 
tive list sequential sampling method can deal with wrong 
participation probability estimates was not investigated 
in this paper. In addition, extreme delayed response to 
the invitation has influence on the performance of the 
list sequential sampling method. Further research is nec- 
essary to determine in which situations the adaptive list 
sequential sampling method succeeds and fails to recruit 
a well spread set of participants. 

A problem that was not considered here was the use 
of multiple invitation techniques in sampling designs. For 
instance, there could be individuals in the population that 
have a low willingness to participate when they are invited 
by a letter, but a much larger willingness when invited 
by telephone. Our method can be adopted by introduc- 
ing multiple participation probabilities by extending step 3 
of our algorithm and estimate multiple logistic regression 
participation probabilities. 

Conclusions 

We showed that correcting for heterogeneity in the par- 
ticipation probability during the recruitment period is an 
effective approach when we have no or partial knowledge 
on the willingness to participate in population studies. By 
inviting individuals from the population in stages, the par- 
ticipation probability can be estimated and used in the 
sampling procedure. 
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